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Abstract 

We propose a novel method for the determination of the effective interaction 
potential between the amino acids of a protein. The strategy is based on the 
combination of a new optimization procedure and a geometrical argument, 
which also uncovers the shortcomings of any optimization scheme. The 
strategy can be applied on any data set of native structures such as those 
available from the Protein Data Bank (PDB). In this work, however, we 
explain and test our approach on simple lattice models, where the true 
interactions are known a priori. Excellent agreement is obtained between 
the extracted and the true potentials even for modest numbers of protein 
structures in the PDB. Comparisons with other methods are also discussed. 



I. INTRODUCTION 



The prediction of the three dimensional structures of the native state of proteins from 
the knowledge of their sequences of amino acids can only be achieved if the interaction 
potentials among various parts of the peptide chain in the presence of solvent molecules 
are known to some extent. Indeed, the native states of many globular proteins correspond 
to the conformations which are global minima of the free energy . Thus the knowledge 
of the energy of a sequence in a given conformation would be an important step in the 
complete solution of this formidable problem, and also of the inverse one, i.e. the design 
of a sequence of amino acids that rapidly folds into a desired conformation. 
A rigorous approach from "first principles" , taking into account the quantum mechanics 
of the huge number of atoms constituting the protein is not practical and beyond actual 
possibilities. 

An alternative approach consists of introducing a coarse-grained description mainly based 
on lattice models where the peptide chain is a self avoiding walk whose nodes represent 
extremely simplified amino acids. Models of this type have been widely used in the recent 
literature for various goals, ranging from folding dynamics to thermodynamic properties 
of folded states of proteins, see e.g. . 

One of the main difficulties with such simplified representations of the protein chain is the 
fact that an effective interaction Hamiltonian has to be used, which captures the essential 
features of the specific properties that one wishes to describe. For example, it is commonly 
believed that the native states of protein sequences ought to correspond to pronounced 
minima in conformation space ||. In the most commonly used model Hamiltonian, "ef- 
fective" two-body forces between neighboring (in space but not in sequence) amino acids 
are the only interactions that are considered [[J- | 15| . These "effective" forces also take 
solvent induced interactions into account. 

Traditionally, the potential energies of the interactions have been derived from pairing fre- 
quencies of amino acids observed in the native structures contained in the PDB |5|-[7|JT^| . 
The method, known as the quasi-chemical approximation, is widely used and relatively 
easy to implement in such a difficult context. In important recent work, Thomas and 
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Dill |L4| have rigorously tested the underlying assumptions and approximations of the 



quasichemical method. Employing a lattice model with an a priori assigned interaction 
potential, one is able to construct a PDB identifying proteins as amino acid sequences 
having a unique ground state conformation (native state) among all possible conforma- 
tions (this is accomplished by exhaustive exact enumeration for sufficiently small values 
of the protein chain length and/or the number of amino acid classes). Applying the qua- 
sichemical method to several of these exact cases, Thomas and Dill jTJJ demonstrated the 
inadequecies of the method and identified its possible weak points. Indeed, the interaction 
parameters could, in the worst cases, be off by as much as a factor of two and the native 
states of protein sequences of the PDB could be correctly identified onaverage in 84% of 
the cases which is poor, for the simple model employed. 

Recently, we have proposed [pl| a new optimization method for the determination of ef- 
fective potentials - the derived potentials in the model considered by Thomas and Dill 



14| are better than the ones obtained by the quasichemical method, but still do not 
match the true potentials. Nevertheless, these derived potentials allow 100% success in 
the prediction of the native structures! In this work we explain why this is so, and show 
that with the bare information contained in the native structures, no unique value of the 
candidate potentials can be given. Starting from a set of "good" sequences, i.e. sequences 
which have a unique ground state with the true potentials, their corresponding native 
structures, and a set of alternative structures, we can isolate a volume (cell) in the space 
of potentials. All points in this cell are equivalent to the true potential as far as the only 
requirement is that each good sequence has to recognize (i.e. has as the unique ground 
state) its native structure. The volume of the cell decreases as the protein chain length 
increases. In order to identify the most likely point around which the cell shrinks, we have 
to come up with some criterion. Here we propose a new implementation which leaves the 
success of good sequences in finding their own native states unaffected, and improves 
considerably the estimate of the extracted potentials. 

As already stressed in fllfiH , our method has its root in the original proposal by Crippen 



but differs substantially in the implementation [ID|. Our method is general, it can 



be implemented at any desired temperature T (lower than the minimum folding transi- 



tion temperature of the good sequences we are considering), and it does not have any- 
adjustable parameter. The method is explained in Sec. 2 , whereas Sec. 3 contains the 
results for a 2- and 4-class amino acid problem. Since our method is applicable in any 
spatial dimensionality, we have restricted ourselves to various checks in a two-dimensional 
square lattice with chains up to length 16 restricted to lie within a 5 x 5 square. A com- 
parison of our results with those obtained using a recently proposed method of Mirny and 
Shakhnovich ( |]15|| ) is made in Sec. 4 . 

II. THE MODEL 

We consider a set of N s sequences Q a = {cr s } s=lj . ^ each comprised of N amino acids 
cr s = {o" l s }j = i v i 7v- Each sequence cr s is postulated to have a unique native state (assumed 
to be the ground state and denoted by the superscript n) in a spatial conformation T™ 
that is known experimentally or otherwise. The corresponding set of native conformations 
is denoted by fir = {T™} s =i,..,n s - 

We assume that for a given number of amino acid types N a , the effective interaction 
potentials can be written in the form of a symmetric interaction matrix P^, fi, v = 1, .., N a 
and that similarly for each combination of a sequence and a conformation, a symmetric 
contact matrix C(r,cr)^, /i, v = l,..,N a is defined, giving the (effective) number of 
contacts between the different types of amino acids. The energy of a sequence a in 
conformation V is thus given by 

N a 

e(t,<t)= £ p^c(r,oV- (!) 

/Z<1/=1 

There are a gigantic number of spatial conformations a sequence cr can take, which we 
label as r«(cr), and T n (cr) = To(cr) is the experimentally determined native state structure. 
At a temperature T, the probability that the sequence is in one of these conformations is 
simply given by 

Pi(tr) = exp [-(EMa), a) - F(a))/T] , (2) 

where the Boltzmann constant is defined equal to 1, and F(er) is the free energy, 
defined as 
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F(ct) = -Tlog fcexp[-i<?(r^),<r)/T]J . 



(3) 



Because the experimentally observed structure of the sequence is in the conformation 
r n (cr), the value of Po(cr) must be large (> |) at temperatures below the folding transition 
temperature. Indeed, Pq(ct) should be equal to 1 at zero temperature, if the ground state is 
non-degenerate. In recent work, Crippen JXX[] has suggested that even with the knowledge 



of the exact contact potential from which the folding sequences and their unique native 
conformations are determined, one may not be able to correctly select which sequences 
fold to a desired target structure. The resolution |T2j of this puzzle is that the right 



"score" to be maximized in the inverse folding problem is (0), i.e. E(T(cr), er) — F(er) has 
to be minimized, and not just the energy E(T(cr), <x). A key feature of this score is that 
F(cr), the free energy, does not depend on the specific target structure, but only on the 
sequence being considered. Thus, the determination of the exact contact potential is a 
valuable first step for an attack on the protein design problem, even though the currently 
used score needs to be modified. 

In what follows, we describe a zero temperature version (which is appropriate in most 
instances) of such a procedure to extract the exact potentials. Furthermore, we restrict 
ourselves to models where each conformation is a self-avoiding walk whose elementary 
steps join nearest neighbor sites of a <i-dimensional hypercubic lattice (d = 2 in the present 
applications). Amino acids are placed at the nodes of the visted sites, and contacts are 
defined between amino acids in neighboring sites but not next to each other along the 
sequence. 



A. The Method 

Instead of starting immediately with a cost function that has to be minimized, we con- 
centrate for a moment on the space spanned by the interaction potentials. 
Since all energies scale linearly with the amplitude of the interaction potentials, we have 
to keep e.g. the first parameter fixed (Pn — > Po) to set a scale. Relabelling the remaining 
parameters P M „ — > p = {Pi}j=i,..,jv p (N p = ±N a (N a + l)-i), and renumbering the contacts 
accordingly, we can rewrite (|l|) as 
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E(T, tr) = Y^PiCiiY, <r) + P c (i» = p ■ c(T, <x) + P c (i» . (4) 

i=i 

The fact that a sequence has a lower energy in its native conformation than in any 
alternative conformation, provides a linear inequality (or hyperplane) in the parameter 
space separating the space into allowed and forbidden halfspaces. We define: 

J,(r al \ rj, <r.) = c 4 (r al \ a.) - Ci (T n s , a.) . (5) 

The allowed points in parameter space have to satisfy the linear inequality 

Y^Pih + P I = p- 1 + P Io > . (6) 

i=l 

Repeating this operation for all the sequences in Q a and for all the alternative confor- 
mations, and retaining only the allowed part of the parameter space that satisfies all the 
inequalities, we obtain a convex cell around the target parameters. This cell contains all 
the points that yield the correct native conformation as the unique ground state for each 
of the sequences in Q a . In the test model, we have generated the set Q a using the energy 
function ([I]), and therefore, the existence of the cell is guaranteed and the problem is well 
posed. For real proteins the form of the energy function is an ansatz that is tested by the 
(non)existence of a finite cell. 

Each inequality corresponds to a hyperplane in parameter space separating allowed and 
forbidden half-spaces. The orientation of the hyperplane is given by /, the offset from 
the origin by -Po^o- The distance of any point p in parameter space to this hyperplane, 
is related to the energy gap between the two configurations leading to this inequality (at 
the value of parameters given by p) by the following linear equation: 

|p-/ + P /o| gap(p,J) 
d(p, I) = = r—^ ■ (7) 



Using all the information in the data set, the cell is maximally reduced. A selection 
procedure is needed in order to isolate an optimal point within the cell. Instead of the 
cost functions used in |Tj|, the optimal interactions are chosen such that the smallest 



gap among all the sequences in the training set is as large as possible. The cost function 
(.Fg a p(P)) is hence taken to be minus the smallest gap, i.e. 
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min min {E(T, a s ) - E(T", cr s 



(p- f(r*, r™ , o>) + P I (F, T",^)) 




where ov and T* are the sequence and the alternative conformation respectively that 
yield the minimum gap. To reiterate, the interaction potentials are chosen in such a way 
that the maximum minimum (mxm-) gap is obtained. For similar ideas see also pj. 
This cost function has two major advantages over previous attempts. First, it automati- 
cally ensures that all sequences have their unique groundstates in the correct structures. 
In fact, a negative mxm-gap would imply that the a priori assumption of the form of the 
energy function ([I]) is incompatible with the data in the training set. 
Second, it does not suffer from an unphysical bias due to statistical fluctuations that 
were present in the cost functions proposed in JTJ|. These cost functions not only make 
use of all inequalities, but also of the number of occurences of each inequality over the 
training set. Therefore, it may happen that inequalities that occur more frequently, push 
the optimal parameters away from their true values. One may expect all sequences in 
the training set to satisfy the minimimum conditions that make them good folders, which 
implies that each inequality is equally important, irrespective of the number of times it 
occurs. In realistic cases it may be important to rescale the energy gap associated with 
a given sequence with respect to its ground state energy. Because, in this work, we have 
sequences of the same length in a given training set, all ground state energies are roughly 
the same, and rescaling of the gap is not necessary. 

Because the width of the energy gap is not equal to the distance in parameter space 
([?]), inequalities that do not contribute to the boundaries of the cell may still influence 
the mxm-gap. Nevertheless, for all the cases that we have encountered, using only the 
inequalities contributing to the cell, we obtain exactly the same optimal parameters as 
the ones derived by using all inequalities. This facilitates our maximization procedure. 
The sequences have to be put on each alternative conformation only once before the op- 
timization procedure, and we retain only those inequalities that contribute to the cell. 
Then we start our optimization procedure with only those few inequalities. Given the 
inequalities with respect to which to optimize, the design of an optimization procedure 
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is straightforward, because the gradient of each inequality is given by /. Therefore, each 
step in the optimization procedure consists of the following replacement 

p->p + 7 /(r,r»,o>) , or (9) 

P-^P + Jlxnin, (10) 

where 7 is a parameter that can be tuned to obtain fast convergence. Form (|9]) of the 
optimization algorithm is used when the equalities have to be recalculated putting each 
good sequence on the alternative conformations, while form (|T0|) is used when the set of 
important inequalities is already known. In that the inequality from this set 

that yields the minimum gap. 

III. RESULTS 

The method has been tested on models with an increasing number of interaction param- 
eters to check the dependence on the dimensionality (N p ) of the parameter space. The 
test has been done on the normal H-P model [|17]], N a = 2, with nearest neighbor (nn) 
interactions, i.e. with 2 free parameters (N p = 2), and some variations like considering 
next to nearest neighbor interactions, to check the robustness of the method. 
Furthermore, the method has been applied to models with 4 types of amino acids (N p = 9) 
and nn interactions. For the latter we have also studied the dependence of the quality of 
the obtained parameters on the size of the PDB and of the set of alternative structures. 

Although still feasible up to parameter numbers as high as N p = 9, increasing the number 
of interaction parameters and thus the dimension of the cells, reveals the tendency that 
the advantage of putting the good sequences on the alternative structures only once, will 
be annihilated by having to calculate too many cornerpoints. 

With increasing dimension, the number of inequalities contributing to the cell grows lin- 
early, while the number of cornerpoints of the cell, however, grows exponentially. There- 
fore, one may have to opt for a hybrid method. The first step consists of a rough op- 
timization recalculating the inequalities at each update (||). Once a point in parameter 
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space satisfying all constraints (i.e. in the cell) has been obtained, we select and save the 
inequalites which are at a distance less than some tolerance parameter from this point. 
The number of such inequalities is relatively low and it grows linearly with the number of 
parameters. Then, we fully optimize only with respect to these inequalities (TO). An im- 
plementation of this method on the model with N p = 9 for different choices of parameters 
shows that we roughly need 20-30 updates for the first step. Then, we need to typically 
save of the order of 100 inequalities to perform the second step. 

This hybrid method is very efficient because it uses the insights in the geometry of the 
cells in parameter space and avoids unnecessary time loss due to exactly calculating the 
cell. 



A. Results for the H-P model 

In order to be able to compare our results with those of previous work, the first model, 
that we study extensively, is the H-P model introduced by Dill and co-workers (171], which 



has 2 types of amino acids. A contact is defined to be 1 for a nearest neighbour con- 
tact, as long as the two amino acids are next to each other along the sequence, and 
otherwise. To fix the energy scale, we have choosen to fix the parameter (Ehh = Pq)- 
Hence, we are left with two independent parameters (Epp = p% and Ehp = P2) and 
a 2 dimensional parameter space, which allows us to clarify our reasoning with in- 
structive pictures. We have considered three types of target interaction parameters: 
{E HH , Epp, E HP ) = (-1,0,0), (-1, -l/y/2 ~ -0.707, 0), (-2,-2,-1), and seven dis- 
tinct groups of amino acid chains each of length: N = 10, .., 16. 



In order to reduce the necessary computer time, we have taken all semi-compact 2 dimen- 
sional conformations on a square lattice. By semi-compact, we mean that we restrict the 
conformations to a box of size 5x5 (Tests with all conformations of a certain length on 
a square lattice show that the results are unaltered). 

As alternative conformations, we have considered both the set of good conformations 
f2p° od (having at least one sequence that has its unique native state in it), and the set of 
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all conformations flf 1 (obtained by complete enumeration and also used to generate the 
good sequences). Although good results can be obtained considering only f2p° od , it is not 
excluded (and is indeed observed) that extra information can be gained (in the sense of 
new inequalities, further reducing the cell) by also considering new conformations from 
f2p U . However, when the cells are closed, in all cases we obtain exactly the same set of 
optimized parameters. Since some inequalities are very rare, it is difficult to say how 
many alternative conformations are needed to maximally reduce the cell. Unfortunately, 
so far we have not found a criterion to determine beforehand whether a certain sequence 
and a given alternative conformation gives rise to a "tight" inequality (contributing to 
the boundaries of the final cell) or not. Therefore, although the obtained parameters are 
relatively stable to changes in the shape of the cell, the best strategy seems to be to use 
as much information as is available, or as is numerically feasible. It cannot be excluded 
that regenerating all the good sequences with the newly obtained parameters, would add 
some new "good" sequences to the set. 



target 


0.0 


0.0 










N 


Ep P 


E HP 


truegap 


mxmgap 


vol(ftf) 


vol(^ ood ) 


10 


1 


1 


1.0 


/ 


0.0° 


2.034444° 


11 


0.0 


0.0 


1.0 


1.0 


1.062500 


0.643501° 


12 


0.0 


0.0 


1.0 


1.0 


0.916667 


0.643501° 


13 


0.0 


0.0 


1.0 


1.0 


0.625000 


0.0° 


14 


0.0 


0.0 


1.0 


1.0 


0.444444 


0.0° 


15 


0.0 


0.0 


1.0 


1.0 


0.272817 


1.203704 


16 


0.0 


0.0 


1.0 


1.0 


0.252315 


0.611111 



(3.1.a) 
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target 


-0.707107 


0.0 










N 


E PP 


Ehp 


truegap 


mxmg 


;ap 


vol^f; 11 ) 


vol(nf od ) 


10 


1 


1 


0.12132 


/ 




0.0° 


0.321751° 


11 


-5/7 


0.0-0.10 


0.12132 


0.142£ 


356 


0.034722 


0.034722 


12 


-5/7 


0.0-0.04 


0.12132 


0.142£ 


356 


0.017361 


0.029514 


13 


-5/7 


0.0 


0.12132 


0.142£ 


356 


0.030556 


0.030556 


14 


-5/7 


0.0 


0.12132 


0.142J 


356 


0.015129 


0.019097 


15 


-5/7 


0.0 


0.12132 


0.142J 


356 


0.007955 


0.007955 


16 


-5/7 


0.0 


0.12132 


0.142^ 


356 


0.007955 


0.007955 



(3.1.b) 



target 


-2.0 


-1.0 










N 


Epp 


Ehp 


truegap 


mxmgap 


vol^f; 11 ) 


voi(nf od ) 


10 


1 


1 


1.0 


/ 


0.0° 


1.249046° 


11 


-2.0 


-1.0 


1.0 


1.0 


0.733333 


0.785398° 


12 


-2.0 


-1.0 


1.0 


1.0 


0.733333 


0.785398° 


13 


-2.0 


-1.0 


1.0 


1.0 


0.900000 


0.566729° 


14 


-2.0 


-1.0 


1.0 


1.0 


0.733333 


0.566729° 


15 


-2.0 


-1.0 


1.0 


1.0 


0.357143 


0.554762 


16 


-2.0 


-1.0 


1.0 


1.0 


0.215320 


0.334641 



(3.1.c) 



From the tables |3.1.a| , |3.1.b| , |3.1.c| and Fig.l, we see that the volume of the cells tends to 
decrease montonically with increasing sequence length. The only exceptions are observed 
with length N = 13, but are probably due to finite size effects. In two cases, a segment 
of a line of points in parameter space yields the same mxm-gap. 

In all the cases that we have considered, and where the ratios of the target potentials are 
rational numbers made up out of small enough integers, the maximization of the min- 
imum gap renders the exact potentials. Furthermore, we observe that all the obtained 
parameters are rational, even if the target parameters are not, due to the fact that in our 
model only an integer number of contacts is possible. It also explains the fact that for 
the target parameters (— 1, — l/y/2, 0), the obtained parameter E PP is invariably —5/7 
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for all sequence lengths N = 11, ..,16 although the cell changes drastically. One would 
have to consider (much) longer sequences to get contact numbers high enough to generate 
a rational number closer to —l/y/2. This insensitivity may be lifted in cases where the 
number of contacts is no longer integer, e.g. for real space proteins. 
For this set of target parameters, we have also considered taking only those good sequences 
with a minimum gap larger than certain threshholds (i.e. 0.5 and 0.75), and although 
the obtained cells are larger (it scales with (mingap)^), the obtained parameters are 
unaltered untill the cell ceases to be closed. 



To get an idea of the performance of the algorithm as the dimension of parameter space 
increases, we did some checks on the following variations: 

-a model with N p = 3, with 2 kinds of amino acids as before (N a = 2), but including a 
next to nearest neighbor (nnn) 
interaction for a H-P contact, 
-a model with N p = 5, with 2 kinds of amino acids (N a = 2) and nn and nnn contacts, 
-a model with N p = 5, with 3 kinds of amino acids (N a = 3) and only nn-contacts, and 
- models with N p = 9, with 4 kinds of amino acids (N a = 4) and only nn-contacts, see 
Sec. 3.2 . 

The quality of the obtained parameters is always as good as those shown in tables 
3 . 1 . a J3 . 1 7b| . 3 . 1 . c and does not depend on N p . 

To check the sensitivity of the method to a wrong choice of energy function, we have 
generated good sequences and structures using 6 interaction parameters (N p = 5, both 
N a = 2, nn- and nnn-contacts and N a = 3, nn-contacts), and tried to satisfy all inequal- 
ities using fewer parameters, e.g. ignoring nnn H-P contacts. The method immediately 
indicated that the cell does not exist, and thus that the number of parameters was insuf- 
ficient. On the other hand, putting in more free parameters than were used to generate 
the good sequences, the irrelevance of these parameters is immediately recognized and 
their obtained values are (very close to) . 
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B. Results for the 4 amino acids problem 



The P^v matrix has 10 independent parameters in this case. Our tests have been 
carried out for four different sets of parameter values where each parameter is generated 
independently from a Gaussian distribution with mean —2 and variance 1. The length 
of the chain is 14. For each set of parameters, we have generated a PDB of about 600 
sequences and their corresponding (unique) native states. Furthermore, the sequences 
have an energy gap, A (the energy difference between the first excited state and the 
ground state) greater then 0.5. Indeed it is thought |2j that real proteins in order to have 
thermodynamical stability and short folding times should possess a pronounced global 
minimum on the potential surface. A comparison with one case where A > 2 is also 
presented. The trial Hamiltonian is parametrized as the true one and we have chosen the 
energy scale by fixing to its exact value one of the most negative -P^'s. The remaining 9 
parameters are then determined maximizing the minimum gap using the method explained 
above. We have also verified that simulated annealing techniques are quite efficient for this 
case and give the same set of extracted potentials as the method used in § |III A| . Figures 
2a,b,c and d show the extracted potentials versus the true ones. The extracted potentials 
are then tested for new sets of "good" sequences for each of the four cases to determine 
their ground state configurations over all possible self avoiding chains of length 14. For all 
the sequences in the PDB, we get full success (Figure 2). Indeed, since the maximun gap 
has been calculated on a restricted set of conformations there is no guarantee a priori that 
the good sequences used in the optimization procedure recognize their own native state 
among all possible conformations. Thus it is important to test the extracted potentials 
using a new independent set of good sequences. In all four cases that we studied, at 
most 2 out of 628 do not found their original native state. The percentages of the correct 
determination of the native states using the extracted potentials are indicated in the table. 
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parameter set 


size of the PDB 


success 


1 


628 


99.7% 


2 


716 


99.9% 


3 


840 


100% 


4 


798 


100% 



We have tested the performance of the method as the size of the PDB is decreased. Figure 
3 shows how the percentage of success depends on the number N of sequences contained 
in the PDB. Only three of the four cases are shown for clarity (the fourth case has the 
same behaviour as the other three). The minimum N used is 14. Note that full success 
is almost reached for N ~ 200 — 300. For the first set of potential parameters, we have 
also generated a PDB with an energy gap A > 2. The results are shown in fig. 3 , and 
saturation is reached at about iV ~ 100. 

IV. COMPARISON WITH OTHER METHODS 

A. The quasi-chemical method 

The quasi-chemical method [5-7] is widely used in various forms for obtaining the ef- 
fective potential between aminoacids and to provide "scores" for candidate protein struc- 
tures. Briefly, the procedure is as follows: from the databank, one compiles the probability 
density, /a,b(^), that two specific aminoacids are at a distance r from each other. This 
quantity is a normalized one, and takes into account how often the individual aminoacids 
appear in the data base. The basic idea is that if A and B like each other, they are more 
likely to be near each other compared to a random reference state of a non-interacting gas 
of aminoacids. Conversely, if A and B dislike each other, they avoid each other a bit more 
than what one would expect from random considerations. This idea is then quantified in 
the form 

E A , B (r)<x-kT\n[f A , B (r)\ . (11) 

Additional considerations pertaining to how far apart two aminoacids are along the se- 
quence are sometimes introduced in order to build in the correct secondary structure. 
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The derived quantities such as Ea,b are now interpreted as the energies of interaction 
between the aminoacid pairs and used for determining which structure among many al- 
ternatives yields the most favorable value of the energy. The results using this method as 
obtained by Thomas and Dill |TJ] are shown in table [O, and have to be compared with 
the corresponding cases in tables |3.1.a| , |3.1.b|]3Tl~c| obtained by our method. 



True 


TD test [14] 


Ehh 


Ehp 


Epp 


Ehh 


Ehp 


Epp 


success 


-5 


-4 


-1 


-5 


-3.0 


+0.8 


74% 


-5 


-1 


-2 


-5 


-1.1 


-2.1 


100% 


-5 


-5 


-1 


-5 


-3.7 


+1.4 


84% 


-5 


-3 


+ 1 


-5 


-2.6 


+2.5 


96% 


-5 


-3 


-1 


-5 


-2.4 


0.0 


64% 



(4.1) 



The rationalization for deriving the interaction energy from the observed pairing frequency 
has been stated to be Boltzmann's principle and also has been called the Boltzmann de- 
vice. Boltzmann statistics pertains to the occupation probabilities for the energy levels of 
an individual system. Thus, if a system can have energies E , E\, E 2 , etc., the probability 
that the system has an energy Et is proportional to exp[— ^]. 



We repeat several observations made in Seno et al. ]T6| . (Thomas and Dill |T4j have 
also presented an important critique of the quasichemical method.) First, the native 
structures of distinct sequences of aminoacids do not correspond to the excitations of a 
single system. Instead, each of the sequences is a separate system, whose native state 
structure is known from experiment. Thus, the basic premise of the method is wrong. 
Second, even making the assumption that Boltzmann statistics did hold, there is no sim- 
ple relationship between the observed pairing frequency and the energy of interaction as 
envisaged by ([TT|). The role of temperature in (|Tl"|) is unclear, because the native states 
of each of the sequences correspond to their ground states or equilibrium states at T = 0. 
Third, the quasichemical method relies on a reference state - the observed pairing frequen- 
cies are compared to those expected in this reference state in order to determine whether 
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two amino acids like each other and by how much. Often, this reference state is chosen 
as a noninteracting gas made up of all the aminoacids constituting all the sequences with 
known native structure. This does not seem to have a physical basis, because the se- 
quences are all distinct entities and do not originate from a common soup of aminoacids. 



These difficulties with the quasichemical method, which were already partially recognized 
in the literature (see ref. |TJ| and references therein), are avoided in our strategy. The 
sequences whose structures are known are analogous to quenched variables in statistical 
mechanics while the conformations that a given sequence can adopt, are the analog of 
annealed variables. A thermodynamic average can be performed over the annealed vari- 
ables but not over the quenched ones. We use Boltzmann statistics but for each sequence 
separately. We deal with the energies directly and not with a derived quantity such as the 
pairing frequency. Indeed, our strategy embodies the complete information in the system 
and, in principle, has information not only about pairing frequencies but also triplet and 
higher order correlations. Our method does not rely on a reference state and the role of 
temperature is well-defined. 



B. Mirny and Shakhnovich's method 

Recently, Mirny and Shakhnovich (MS) ]nj have proposed to use the Z-score pi 
which is a measure of how pronounced the energy minimum corresponding to the native 
state is, to carry out protein design. The Z-score is given by: 

v ' var(£) v ; 

where the average of E, (E), and its variance, vai(E), are computed for a set of alternative 



(decoy) conformations. The method of |15| entails the minimization of the cost function 
(harmonic mean) 

(Z) h&rm = ( £ Z(a)-y . (13) 
0"efi CT 

For each conformation T of the ensemble Q a , the average (E) and the variance vax(E) are 
calculated in an ensemble of phantom conformations with the same number of residue- 
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residue contacts as in T and with the assumption that these contacts occur independently 
of each other. This approximation will be discussed further later on. One can repeat 
these calculations for our cases. 

We have implemented this method using (|T2| ) and (|13"D, calculating (E) and vai(E) ex- 
actly for each sequence using the structures of the PDB. 

In order to have a finite minimum of (Z)harm, it is necessary to fix the variance, or equiv- 
alently one of the interaction parameters like we did. MS also fix the average potential, 
which requires more information, which, apriori, one does not have. In the case of many 
interaction parameters, however, this might not be so crucial. Furthermore, in their im- 
plementation, MS do not explicitly require that all the Z(cr)'s be negative, since (E) and 
vai(E) are approximated. For the H-P model, we first require that Z(cr) < for all cr in 
the PDB, and then we minimize (Z)harm- 
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-0.707107 


0.0 
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Ep P 


Ehp 




^wrong 


success 


12 


-0.969868 


-0.747827 


0.010588 


728 


4 


99.45% 


13 


-0.990930 


-0.719754 


0.003577 


750 





100.0% 


14 


-0.977091 


-0.738392 


0.008377 


2005 


26 


98.70% 


15 


-0.963963 


-0.755398 


0.012254 


4302 


77 


98.21% 


16 


-0.972729 


-0.744113 


0.009735 


8892 


151 


98.30% 



true 


-1.0 


-0.707107 


0.0 
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Ehh 


Epp 


Ehp 




^wrong 


success 


12 


-0.655207 


-0.330015 


0.474504 


728 


124 


82.97% 


13 


-0.841471 


-0.317697 


0.568875 


750 


198 


73.60% 


14 


-0.745093 


-0.317782 


0.526300 


2005 


675 


66.33% 


15 


-0.657765 


-0.327509 


0.477554 


4302 


1754 


59.23% 


16 


-0.740636 


-0.328271 


0.517735 


8892 


3274 


64.00% 



For the 4 amino acid case, the minimization of ( |I~3"D leads to spurious minima correspond- 
ing to Y<(t Z{ (T ) 1 — 0, since both positive and negative Z(cr)'s appear. This happens 
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both with the exact (E) and vax(E) , and with the approximations of MS Hl5 |. 
Since the search of the parameter domain where all the Z(cr)'s are negative was imprac- 
tical in this case, we have modified ([13]) to (| ZD harm = (J2cr |^(f) I x ) 1 • 



Tables |4.2.a|,p~2T5| show the results for one of the H-P cases we have considered before, i.e. 



Ehh = — 1, Ehp = —l/y/2 and Epp = . Table ^4.2.a| corresponds to the case where we 



have fixed both the variance and the average of the £"s, leaving only one parameter to 
be determined. Table [4.2.b| shows the results when only the variance of the interaction 



parameters is fixed to set the energy scale, and two parameters are left to be determined, 
as in our method. 

With the parameters obtained from the minimization we checked how many of the good 
sequences of the PDB still have their unique ground state in the correct conformation 
among all the possible conformations obtained by exact enumeration. In contrast to our 
method, not all of the sequences in the PDB find the correct native state, as can be seen 
in the tables. Note that neither the success rate, nor the values of the potentials are 
monotonic as a function of the chain length, at least within the small range of lengths 
used. 



Table [4.2.cj shows the results for the 4 amino acid case for the same four sets of po- 



tential parameters used to test our method. The variance and the average potential have 
been fixed to their exact values (thus there are 8 free parameters and not 9 as in our case). 
For one parameter set we have also implemented the MS optimization method. MS use 
the following expressions for {E) and var(.E') (see the discussion following eq.|l3| ): 

(E) Y^l' il 4l (A,,) (14) 

i<j 

i<j k<l 

with 

(Am) = (16) 
'hot 

and 
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T 



1 

'lot 



ij,kl 



(17) 



_L 1 

ntot n^~ t 



where n is the number of contacts in the native conformation, n to t is the total number of 
the topologically possible contacts and the indices i, j, . . . run from 1 to the length of the 
chain (14 in our case). Using the MS hypothesis, we found a different expression for T^-jy: 

n(n-l) _ n 2 



T 



ij,kl 



n tot (n tot -l) M) ^ (M) 



n n 



The results corresponding to both assignments, (fl8|) and (|17D , are also reported in table 



4.2.c| and should be compared with the results of our method in table |3.2| . Figures 4. a, 
b, c and d are the analogs of fig. 2. a, b, c and d for the MS method. Figure 4. a shows 
the extracted potentials using both the exact (E) and var(E') and the approximation (|T6|) 
and (^) (which according to table [4.2.cj works better than the MS one, i.e. (^) and (|T7|)) 
for parameter set 1. 



parameter set 




i^wrong 


success 


1 


628 


64 


89.8 % 


2 


716 


80 


88.2 % 


3 


840 


14 


98 % 


4 


798 


96 


88 % 


1 (using eq 18) 


628 


71 


88 % 


1 (using eq |i~7l) 


628 


105 


83 % 



(4.2.c) 
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figure captions 

Fig.l The 2 dimensional cells for the target parameters (Ehh = —l,Epp = Q,Ehp = 0) (in units 
of kpT), using all the structures as alternatives, for different sequence lengths. Indicated are the cell 
(shaded area), the target parameters (fat point), the sequence length (10, 12, 14, 16), the volume (if the 
cell is closed) or the opening angle alpha of the allowed area. 



Fig. 2 Derived potential versus true potential for the 4 aminoacid problem (in units of kpT). Results 
for the parameter set 1 (a), 2 (b), 3 (c) and 4 (d). 



Fig. 3 Effect of the PDB size on the success rate using the extracted parameters to determine the 
ground state configurations for new sets of sequences, for different minimum energy gaps A (in units of 
fceT). For the case with A > 2 (open triangles), the cell is not closed for small N. When the cell is 
closed, the success rate is almost 100 



Fig. 4 Derived potential versus true potential (in units of kpT), for the 4 aminoacid problem using 
the MS method. Results for the parameter set 1 (a), 2 (b), 3 (c) and 4 (d). Fig. l.a also shows the 
results using the approximation ( |lrj| ) and ([l8|) (open circles). 



table captions 



Tab. 3.1.a , 3~7l.b| , |3.1.c| Results for the H-P model fixing Ehh to its true value to set the energ; 



y 



scale (in units of kpT). The table shows the sequence length, the obtained interaction parameters, the 
true minimum gap with the target parameters, the obtained mxm-gap, the volume of the cell (or opening 
angle in cases in which the cell is not closed) both using all conformations and only the "good" ones 
as alternative conformations. The success rate in the prediction of native conformations of the "good" 
sequences with the obtained parameters is 100% in all cases that the cell is closed. 
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Tab. 3.2 Results for the 4 aminoacid model. For each parameter set, the table shows the size of the 
PDB used for the derivation of the potential and the success rate in the correct prediction of the native 
state for each of the training set sequences. 



Tab. 4^1 Summary of the results of TD |L4j using the Miyazawa-Jernigan scheme H, for a sequence 
length of 14 monomers. The table shows the true parameters, the parameters obtained by TD and the 
success rate in the prediction of native conformations of the "good" sequences with the obtained param- 
eters. 



Tab. 4. 2. a Results for the H-P model, using the Z-score of MS |lq] , fixing both the variance and the 
average of the interaction parameters to their true values. The table displays the sequence length, the 
derived interaction parameters, the total number of good sequences, the number of sequences for which 
the predicted ground state is wrong, and the success rate. 



Tab. 4.2. b Same as Tab. 4.2. a, but only fixing the variance of the interaction parameters. 



Tab. 4.2.C Results for the 4 aminoacid model, using the MS method [la], fixing both the variance 
and the average of the interaction parameters to their true values. The table shows the identification of 
the parameter set, the size of the PDB used for the derivation of the potentials, the number of failures 
in the prediction of the correct native state, and the success rate. In the top 4 lines, the exact (E) and 
var(E) are used, whereas for the fifth line the approximation ( jHj ) and ([L8|), and for the sixth line the 
approximation (^) and ( p"7| ) are used. 
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